In this practical, a number of R packages are used. If any of them are not installed you can still follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice (version: 3.4.0)visdat (version: 0.5.3)JointAI (version: 0.5.1)VIM (version: 4.8.0)plyr (version: 1.8.4)corrplot (version: 0.84)You can find help files for any function by adding a ? before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this example, we will use the NHANES dataset. To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES_for_practicals.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES_for_practicals.RData").
Let’s take a first look at the data. Useful functions are dim(), head(), str() and summary().
## [1] 1000 24
## 'data.frame': 1000 obs. of 24 variables:
## $ HDL : num 1.58 1.24 1.71 1.01 1.09 NA NA 1.16 1.16 1.5 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 2 5 3 3 3 3 2 3 1 3 ...
## $ DBP : num 56.7 81.3 70 66 69.3 ...
## $ bili : num 0.5 0.9 0.7 0.4 0.9 NA NA 0.9 0.6 0.9 ...
## $ smoke : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 3 1 3 2 1 2 2 3 ...
## $ DM : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ gender : Factor w/ 2 levels "male","female": 2 1 1 2 1 2 1 1 1 1 ...
## $ WC : num 64.5 81.4 76 139.5 94.6 ...
## $ chol : num 4.29 4.27 4.22 3.96 4.97 NA NA 5.2 5.56 4.86 ...
## $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: NA NA NA NA NA 3 3 NA NA NA ...
## $ alc : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 2 NA NA 2 4 1 1 5 3 4 ...
## $ SBP : num 105 125 133 141 117 ...
## $ wgt : num 46.3 63.5 62.1 113.9 102.1 ...
## $ hypten : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 1 1 1 ...
## $ cohort : chr "2011" "2011" "2011" "2011" ...
## $ occup : Factor w/ 3 levels "working","looking for work",..: 1 1 1 1 1 3 3 1 1 1 ...
## $ age : num 22 39 51 45 31 75 73 52 29 40 ...
## $ educ : Factor w/ 5 levels "Less than 9th grade",..: 4 5 5 5 3 1 2 4 3 1 ...
## $ albu : num 3.8 4.3 4.3 3.6 4.9 NA NA 3.9 4.9 4.3 ...
## $ creat : num 0.61 0.87 0.89 0.61 0.83 NA NA 0.91 0.93 0.94 ...
## $ uricacid: num 3.6 5.4 6.2 4.3 6.1 NA NA 7.8 3.9 4.9 ...
## $ BMI : num 19.3 19.5 22.1 41.8 32.3 ...
## $ hypchol : Factor w/ 2 levels "no","yes": 1 1 1 1 1 NA NA 1 1 1 ...
## $ hgt : num 1.55 1.8 1.68 1.65 1.78 ...
## HDL race DBP bili smoke DM
## Min. :0.360 Mexican American :112 Min. : 12.67 Min. :0.2000 never :608 no :853
## 1st Qu.:1.110 Other Hispanic :110 1st Qu.: 64.00 1st Qu.:0.6000 former :191 yes:147
## Median :1.320 Non-Hispanic White:350 Median : 70.67 Median :0.7000 current:199
## Mean :1.391 Non-Hispanic Black:229 Mean : 70.80 Mean :0.7527 NA's : 2
## 3rd Qu.:1.580 other :199 3rd Qu.: 77.33 3rd Qu.:0.9000
## Max. :4.030 Max. :130.00 Max. :2.9000
## NA's :86 NA's :59 NA's :95
## gender WC chol HyperMed alc SBP
## male :493 Min. : 61.90 Min. : 2.070 no : 25 0 :113 Min. : 81.33
## female:507 1st Qu.: 85.00 1st Qu.: 4.270 previous: 20 <=1 :281 1st Qu.:109.33
## Median : 95.10 Median : 4.910 yes :142 1-3 :105 Median :118.00
## Mean : 96.35 Mean : 4.998 NA's :813 3-7 : 81 Mean :120.15
## 3rd Qu.:105.50 3rd Qu.: 5.610 >7 :101 3rd Qu.:128.67
## Max. :154.70 Max. :10.680 NA's:319 Max. :202.00
## NA's :49 NA's :86 NA's :59
## wgt hypten cohort occup age
## Min. : 39.01 no :693 Length:1000 working :544 Min. :20.00
## 1st Qu.: 63.50 yes :265 Class :character looking for work: 46 1st Qu.:31.00
## Median : 76.88 NA's: 42 Mode :character not working :393 Median :43.00
## Mean : 78.35 NA's : 17 Mean :45.23
## 3rd Qu.: 89.13 3rd Qu.:59.00
## Max. :167.83 Max. :79.00
## NA's :22
## educ albu creat uricacid BMI
## Less than 9th grade : 83 Min. :3.000 Min. :0.4400 Min. :2.300 Min. :15.34
## 9-11th grade :133 1st Qu.:4.100 1st Qu.:0.6900 1st Qu.:4.400 1st Qu.:23.18
## High school graduate:228 Median :4.300 Median :0.8200 Median :5.300 Median :26.58
## some college :283 Mean :4.289 Mean :0.8525 Mean :5.356 Mean :27.49
## College or above :272 3rd Qu.:4.500 3rd Qu.:0.9500 3rd Qu.:6.200 3rd Qu.:30.73
## NA's : 1 Max. :5.400 Max. :7.4600 Max. :9.900 Max. :60.54
## NA's :91 NA's :91 NA's :92 NA's :37
## hypchol hgt
## no :813 Min. :1.397
## yes :101 1st Qu.:1.626
## NA's: 86 Median :1.676
## Mean :1.685
## 3rd Qu.:1.753
## Max. :1.981
## NA's :22
It is important to check that all variables are coded correctly, i.e., have the correct class. Imputation software (e.g., the mice package) uses the class to automatically select imputation methods. When importing data from other software it can happen that factors become continuous variables or that ordered factors lose their ordering.
str() showed that in the NHANES data smoke and alc are correctly specified as ordinal variables, but educ is an unordered factor.
Using levels(NHANES$educ) we can print the names of the categories of educ. Convert the unordered factor to an ordered factor, for example using as.ordered(). Afterwards, check if the conversion was successful.
## [1] "Less than 9th grade" "9-11th grade" "High school graduate" "some college"
## [5] "College or above"
## Ord.factor w/ 5 levels "Less than 9th grade"<..: 4 5 5 5 3 1 2 4 3 1 ...
In the summary() we could already see that there are missing values in several variables. The missing data pattern can be obtained and visualized by several functions from different packages. Examples are
md.pattern() from package micemd_pattern() from package JointAI (with argument patter = TRUE)aggr() from package VIMvis_dat() and vis_miss() from package visdatExplore the missing data pattern of the NHANES data.
## race DM gender cohort age educ smoke occup wgt hgt BMI hypten WC DBP SBP HDL chol hypchol albu
## 33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 527 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 75 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 159 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 10 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0
## 11 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 12 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## 3 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0
## 2 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1
## 4 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0
## 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0
## 0 0 0 0 0 1 2 17 22 22 37 42 49 59 59 86 86 86 91
## creat uricacid bili alc HyperMed
## 33 1 1 1 1 1 0
## 527 1 1 1 1 0 1
## 75 1 1 1 0 1 1
## 159 1 1 1 0 0 2
## 2 1 1 0 1 0 2
## 1 1 1 0 0 0 3
## 1 1 0 0 1 0 3
## 3 0 0 0 1 0 5
## 2 0 0 0 0 1 5
## 15 0 0 0 1 1 7
## 21 0 0 0 1 0 8
## 8 0 0 0 0 1 8
## 12 0 0 0 0 0 9
## 10 1 1 1 1 1 2
## 3 1 1 1 0 1 3
## 10 1 1 1 1 1 1
## 6 1 1 1 1 0 2
## 4 1 1 1 0 1 2
## 3 1 1 1 0 0 3
## 3 0 0 0 1 1 8
## 3 0 0 0 0 1 9
## 3 0 0 0 0 0 10
## 3 1 1 1 0 1 4
## 1 0 0 0 1 1 10
## 2 0 0 0 0 1 11
## 2 1 1 1 1 0 2
## 1 0 0 0 1 0 9
## 11 1 1 1 1 0 4
## 12 1 1 1 0 0 5
## 2 0 0 0 1 0 11
## 3 0 0 0 0 0 12
## 1 1 1 1 0 0 6
## 1 0 0 0 1 0 12
## 1 0 0 0 0 0 13
## 2 1 1 1 1 1 2
## 4 1 1 1 1 0 3
## 1 1 1 1 0 1 3
## 2 1 1 1 0 0 4
## 1 0 0 0 1 0 10
## 1 1 1 1 0 1 6
## 1 1 1 1 1 0 6
## 1 1 1 1 0 0 7
## 1 1 1 1 0 0 8
## 1 0 0 0 0 0 15
## 3 1 1 1 1 1 2
## 1 1 1 1 1 0 3
## 2 1 1 1 0 1 3
## 3 1 1 1 0 0 4
## 1 0 0 0 0 0 11
## 2 0 0 0 0 1 11
## 1 0 0 0 0 0 12
## 1 1 1 1 0 0 7
## 1 1 1 1 0 0 8
## 4 1 1 1 1 0 4
## 2 1 1 1 0 0 5
## 4 1 1 1 1 1 1
## 9 1 1 1 1 0 2
## 2 0 0 0 0 0 10
## 1 1 1 1 0 0 6
## 1 1 1 1 1 0 5
## 1 1 1 1 1 0 2
## 1 0 0 0 0 0 10
## 1 0 0 0 1 0 13
## 91 92 95 319 813 2069
md.pattern() from package mice gives us a matrix where each row represents one missing data pattern (1 = observed, 0 = missing). The rownames show how many rows of the dataset have the given pattern. The last column shows the number of missing values in each pattern, the last row the number of missing values per variable.
md.pattern() also plots the missing data pattern automatically.
## race DM gender cohort age educ smoke occup wgt hgt BMI hypten WC DBP SBP HDL chol hypchol albu
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 8 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 9 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 11 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 12 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 14 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 15 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 16 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 17 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 19 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 20 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
## 22 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 23 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 24 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 25 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 26 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 27 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 28 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 30 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 31 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 33 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 34 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0
## 35 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 36 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 37 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 38 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 39 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 40 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0
## 41 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0
## 42 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 43 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 44 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1
## 45 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1
## 46 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 47 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1
## 48 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1
## 49 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0
## 50 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 51 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0
## 52 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1
## 53 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 54 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 55 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 56 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1
## 57 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 58 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## 59 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## 60 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1
## 61 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 62 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 63 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0
## Nmis 0 0 0 0 0 1 2 17 22 22 37 42 49 59 59 86 86 86 91
## creat uricacid bili alc HyperMed Npat
## 1 1 1 1 1 0 527
## 2 1 1 1 0 0 159
## 3 1 1 1 0 1 75
## 4 1 1 1 1 1 33
## 5 0 0 0 1 0 21
## 6 0 0 0 1 1 15
## 7 0 0 0 0 0 12
## 8 1 1 1 0 0 12
## 9 1 1 1 1 0 11
## 10 1 1 1 1 1 10
## 11 1 1 1 1 1 10
## 12 1 1 1 1 0 9
## 13 0 0 0 0 1 8
## 14 1 1 1 1 0 6
## 15 1 1 1 0 1 4
## 16 1 1 1 1 1 4
## 17 1 1 1 1 0 4
## 18 1 1 1 1 0 4
## 19 0 0 0 1 1 3
## 20 1 1 1 0 0 3
## 21 1 1 1 0 1 3
## 22 0 0 0 0 0 3
## 23 1 1 1 0 1 3
## 24 1 1 1 1 1 3
## 25 0 0 0 1 0 3
## 26 0 0 0 0 1 3
## 27 0 0 0 0 0 3
## 28 1 1 1 0 0 3
## 29 0 0 0 0 1 2
## 30 0 0 0 0 1 2
## 31 0 0 0 1 0 2
## 32 1 1 0 1 0 2
## 33 1 1 1 0 1 2
## 34 0 0 0 0 0 2
## 35 1 1 1 0 0 2
## 36 1 1 1 1 0 2
## 37 0 0 0 0 1 2
## 38 1 1 1 1 1 2
## 39 1 1 1 0 0 2
## 40 0 0 0 0 0 1
## 41 0 0 0 0 0 1
## 42 0 0 0 0 0 1
## 43 1 1 1 0 0 1
## 44 1 1 1 0 0 1
## 45 1 1 1 0 0 1
## 46 1 1 0 0 0 1
## 47 1 1 1 1 0 1
## 48 1 1 1 0 0 1
## 49 0 0 0 1 0 1
## 50 1 1 1 0 1 1
## 51 0 0 0 1 0 1
## 52 1 1 1 0 0 1
## 53 0 0 0 1 0 1
## 54 1 1 1 1 0 1
## 55 0 0 0 0 0 1
## 56 1 1 1 0 0 1
## 57 1 0 0 1 0 1
## 58 1 1 1 1 0 1
## 59 0 0 0 0 0 1
## 60 1 1 1 0 1 1
## 61 1 1 1 1 0 1
## 62 0 0 0 1 1 1
## 63 0 0 0 1 0 1
## Nmis 91 92 95 319 813 2069
The function md_pattern() from package JointAI gives a matrix very similar to the one obtained from mice::md.pattern(). Hovever, here, the number of rows in the data that have a particular missing data pattern are given in the last column.
For more information on how to customize the visualization by md_pattern() see the vignette Visualizing Incomplete Data on the webpage of JointAI.
aggr() from VIM plots a histogram of the proportion of missing values per column as well as a visualization of the missing data pattern. Here, a small histogram on the right edge of the plot shows the proportion of cases in each pattern.
For more options how to customize the plot, see the VIM documentation.
Now get an overview of
The function complete.cases() will give you a vector that is TRUE if the the case is complete and FALSE if there are missing values.
is.na() returns TRUE if the value is missing, FALSE if the value is observed
colSums() calculates the sum of values in each column of a data.frame or matrix
colMeans() calculates the mean of values in each column of a data.frame or matrix
# number and proportion of complete cases
cbind(
"#" = table(ifelse(complete.cases(NHANES), 'complete', 'incomplete')),
"%" = round(100 * table(complete.cases(NHANES))/nrow(NHANES), 2)
)## # %
## complete 33 96.7
## incomplete 967 3.3
# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(NHANES))),
"% NA" = round(sort(colMeans(is.na(NHANES))) * 100, 2))## # NA % NA
## race 0 0.0
## DM 0 0.0
## gender 0 0.0
## cohort 0 0.0
## age 0 0.0
## educ 1 0.1
## smoke 2 0.2
## occup 17 1.7
## wgt 22 2.2
## hgt 22 2.2
## BMI 37 3.7
## hypten 42 4.2
## WC 49 4.9
## DBP 59 5.9
## SBP 59 5.9
## HDL 86 8.6
## chol 86 8.6
## hypchol 86 8.6
## albu 91 9.1
## creat 91 9.1
## uricacid 92 9.2
## bili 95 9.5
## alc 319 31.9
## HyperMed 813 81.3
In the missing data pattern we could already see that some variables tend to be missing together. But there may be different types of relationships between variables that are of interest.
Our data contains hgt (height), wgt (weight) and BMI. Check the missing data pattern specifically for those three variables.
# Three solutions to choose from:
JointAI::md_pattern(NHANES[, c("hgt", "wgt", "BMI")], pattern = TRUE, plot = FALSE)## hgt wgt BMI Npat
## 1 1 1 1 963
## 2 1 0 0 15
## 3 0 1 0 15
## 4 0 0 0 7
## Nmis 22 22 37 81
with(NHANES,
table(ifelse(is.na(hgt), 'height mis.', 'height obs.'),
ifelse(is.na(wgt), 'weight mis.', 'weight obs.'),
ifelse(is.na(BMI), 'BMI mis.', 'BMI obs.'))
)## , , = BMI mis.
##
##
## weight mis. weight obs.
## height mis. 7 15
## height obs. 15 0
##
## , , = BMI obs.
##
##
## weight mis. weight obs.
## height mis. 0 0
## height obs. 0 963
plyr::ddply(NHANES, c(height = "ifelse(is.na(hgt), 'missing', 'observed')",
weight = "ifelse(is.na(wgt), 'missing', 'observed')",
BMI = "ifelse(is.na(BMI), 'missing', 'observed')"),
plyr::summarize,
N = length(hgt)
)As we have already seen in the lecture, there are some cases where only either hgt or wgt is missing. BMI is only observed when both components are observed. To use all available information, we want to impute hgt and wgt separately and calculate BMI from the imputed values.
The data contains variables on hypertension (hypten) and the use of medication to treat hypertension (HyperMed). We might expect that medication is only prescribed for patients with hypertension, hence, we need to investigate the relationship between HyperMed and hypten.
## hypten
## HyperMed no yes <NA>
## no 0 25 0
## previous 0 20 0
## yes 0 142 0
## <NA> 693 78 42
Furthermore, we can expect a systematic relationship between the variables hypchol (Hypercholesterolemia) and chol (serum cholesterol). Find out how these two variables are related.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.070 4.220 4.780 4.755 5.380 6.180 86
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.21 6.39 6.70 6.95 7.16 10.68 86
It seems that hypchol is defined as chol > 6.2, which makes hypchol dependent on chol.
Visualize the correlations between variables to detect associations that you may have missed.
# convert all variables to numeric
dat_num <- sapply(NHANES, as.numeric)
cormat <- cor(dat_num, use = 'pair', method = 'spearman')
corrplot::corrplot(cormat, method = 'square', type = 'lower')Note: Correlations involving categorical variables are not usually done! We only use this as a quick-and-dirty method for visualization of relationships.
The questionmaks in the plot indicate that no correlation could be calculated for cohort (because cohort is the same for all subjects) and neither between HyperMed and hypten (but we already knew that).
The plot shows that besides the relations we were already aware of, wgt is strongly correlated with WC. Comparing the number of cases where only either wgt or WC is missing shows that there are 14 cases where wgt is missing but WC is observed and 58 cases where WC is missing and wgt is observed.
with(NHANES, table(ifelse(is.na(WC), 'WC mis.', 'WC obs.'),
ifelse(is.na(wgt), 'wgt mis.', 'wgt obs.')
))##
## wgt mis. wgt obs.
## WC mis. 4 45
## WC obs. 18 933
Before imputing missing values it is important to take a look at how the observed parts of incomplete variables are distributed, so that we can choose appropriate imputation models.
Visualize the distributions of the incomplete continuous and categorical variables. The package JointAI provides the convenient function plot_all(),
Pay attention to
To learn more about additional options of plot_all() check out the vignette Visualizing Incomplete Data on the webpage of JointAI.
© Nicole Erler